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Abstract — Speech activity detection (SAD) is an essential 
component for a variety of speecli processing applications. It 
has been observed that performances of various speech based 
tasks are very much dependent on the efficiency of the SAD. 
In this paper, we have systematically reviewed some popular 
SAD techniques and their applications in speaker recognition. 
Speaker verification system using different SAD technique are 
experimentally evaluated on NIST speech corpora using Gaussian 
mixture model- universal background model (GMM-UBM) based 
classifier for clean and noisy conditions. It has been found that 
two Gaussian modeling based SAD is comparatively better than 
other SAD techniques for different types of noises. 

Index Terms — Voice activity detection (VAD), Speech Activity 
Detection (SAD), G.729B, Bi Gaussian Modeling, NOISEX-92. 



I. Introduction 

Speech activity detection (SAD) is an important task in 
most of the speech processing appUcations. The function of a 
SAD is to distinguish silence, non-speech frames from speech 
signals. It has been found that the presence of non-speech 
frames considerably affects performance of the system. Speech 
activity detection is also called voice activity detection. How- 
ever, as in speech processing terminology voice and speech 
are not same, we call the task of identifying speech frames 
as SAD throughout this work. SAD techniques are designed 
using various methods. Most of them use heuristically chosen 
statistical properties of speech parameters like: energy, pitch, 
entropy etc. Therefore, the performance of different SAD 
are different and varying according to the level and type of 
signal-to-noise ratio (SNR). As a result, the performances of 
different speech based systems are significantly sensitive to the 
employed SAD technique. Therefore, SAD should be carefully 
chosen while designing a speech based system. Speech activity 
detection is rigorously studied for speech recognition, speech 
coding etc. However, it is not so far thoroughly studied for 
speaker recognition applications. Very recently, it has drawn 
attention of the researchers in this field ||T|-|[3l. 

In current speaker recognition systems, energy based SADs 
are predominantly used. For example, Kinnunen et. al has 
employed energy based SAD which is found useful for NIST 
speech data Hj. The baseline speaker recognition system 
developed in Pi also uses a different energy based SAD. A 
bi-Gaussian model of speech frame's log energy distribution 
is suggested in |6]. In recent days, different SADs are used 
for different quality of speech signal. For example, in |2|, 
for speech frame selection Hungarian phoneme recognizer 
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developed at BUT is used for telephone quality speech, and on 
the other hand, GMM based approach is used for microphone 
quality speech |8|. Other than this, voice activity detector 
used in G.729B, statistical voice activity detector proposed 
by Sohn et. al are also used for speech activity detection in 
speaker recognition. In this paper, we briefly review all those 
techniques. Then, their performances are compared for two 
popular NIST speech corpus for clean and noisy environmental 
conditions. The performances is also evaluated on a simulated 
real-time situation where speech utterances of training and 
testing are distorted with various noises of different SNRs. 
In most of the cases, we have found that speaker recognition 
systems with two Gaussian modeling based SADs are signif- 
icantly better than other techniques. 

The rest of the paper is organized as follows. In Section Ull 
the existing speech activity detectors are briefly reviewed. 
The experimental setup used in this paper is discussed in 
Section |lll] The results obtained in different experiments are 
discussed in Section |IV] Finally, the paper is concluded in 
Section [V] 

II. Review of Some Popular SAD Techniques 

Most of the speech activity detectors are based on either 
time domain or frequency domain approach. Various time 
domain features like short-time average energy (STAE), short- 
time average magnitude (STAM), zero-crossing rate (ZCR) and 
so on are used in time domain. On the other hand, in frequency 
domain, various spectral information are used for designing a 
SAD. There are numerous examples where these time and 
frequency domain information and their statistical properties 
are used to develop robust speech activity detector |9|-[14|. 
Speech activity detectors based on periodicity measure of 
speech signal is used in ifTSI . Cepstral information based SAD 
is proposed in [161. In ifTTl . SAD based on long term speech 
information is proposed for automatic speech recognition. 
Transformed domain characteristics of speech signal are used 
to design SAD in |18|. Entropy based SAD is proposed 
in |fT9l . Divergence of subband information is utilized in ll20l . 
Recently, modulation spectrum information in terms of delta- 
phase spectrum is used to design robust voice activity detection 
for robust speaker recognition [21 1. 

A concise and updated review of the existing SAD tech- 
niques are not available yet. However, some older survey exists 
in this domain. For example, a class of frequency domain 
voice activity detector used for VoIP speech compression is 
compared in |22|. In |23|, three popular voice activity de- 
tectors in speech coding domain are experimentally evaluated 
and assessed using different subjective and objective indices. 
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VADs are compared in wireless application domain in Il24ll . As 
there are enormous amount of work in this domain it is quite 
difficult to evaluate all of them in a single work. In this paper, 
we review five popular SAD techniques which are commonly 
used in state-of-the art speaker recognition system. 

A. SAD Used in G.729B 

Voice activity detection used in G.729 codec is widely used 
in different speech processing applications. This technique 
classifies active and inactive voice frames with the help of 
multiple speech parameters in the following steps: 

• First, four parameters: line spectral pairs (LSFs), full- 
band energy, low-band energy and the zero-crossing rate 
are computed for different speech frames of the given 
speech signal. 

• Parameters of initial frames of the signals are considered 
as background noise information and distortion is mea- 
sured with respect to those background parameters. 

• Initial VAD decisions are made using multiboundary of 
four parameter set. 

• Initial VAD decisions are smoothed in four steps in order 
to reflect stationary nature of speech and background 
signal. 

The VAD algorithm is terminated after the estimated back- 
ground noise information crosses a pre-defined threshold. 

B. Statistical SAD jfTOl/ 

In this SAD technique, a statistical model is used where 
the decision are taken based on likelihood ratio test (LRT). 
In addition to that, a decision-directed (DD) method is used 
to estimate various parameters. Finally, it also introduces an 
improved hang-over scheme based on hidden Markov model 
(HMM). The purpose of hang-over scheme is to detect the 
speech frames which almost buried in noise. It has been shown 
that this approach performs significantly better than G.729B 
based SAD in low SNR for speech frame detection. 

C. Energy Based SAD f^, i[5]/ 

Energy based SAD techniques are very straightforward, and 
they are widely used in speech and speaker recognition appli- 
cation. First, energy of all the speech frames are computed 
for a given speech utterance. Then, an empirical threshold 
is selected from the frame energies. In ||4l, the threshold is 
determined from the maximum energy of the speech frames. 
In 0, the threshold is selected as 0.06 x Eavg, where Eavg 
is the average energy of the frames a speech utterance. These 
kinds of techniques are somewhat suitable for clean condition. 
But, the performance of the system degrades significantly in 
low SNR. 

D. Phoneme Recognizer Based SAD 4251/ 

State-of-the art speaker recognition system uses Hungarian 
phoneme recognizer fooQ to mark speech and non-speech 
frames for telephone quality speech signals ll26l . This 
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Fig. 1. Bi Gaussian model of log-energy values of speech frames. Gaussian 
distribution function with 'Red' color represents speech component and 'Blue' 
color represents noise/background component. Here, 6 is the threshold. 

tool identifies speech segments as speech or non-speech 
segments. The speech segments are assigned to different 
phonemes. On the other hand, non speech segments are 
of four kindfl (i)pau: pause within speech signal, (n)spk: 
speaker related noises, (iii)ifa: stationary noise, and (iv)/nf: 
intermittent noise. This tool uses a neural network and hidden 
Markov model based technique to identify phonemes. Neural 
network is used to train speech frames according to their target 
labels which is usually obtained from a standard database. 

E. Bi Gaussian Modeling of log-Energies Based SAD ^ 

Bi-Gaussian modeling for speech frame selection is briefly 
explained in f6l, f28l. In this method, first the log-energies 
of each speech frames of a speech utterance are computed. 
Then the distribution of log-energy coefficients is estimated 
using Gaussian mixture model of two mixtures. The cluster 
corresponding to smaller value of center is treated as noise 
or non-speech class, and the cluster corresponding to larger 
value of center is considered as speech (Fig.[T]i. A threshold is 
computed to determine the decision making boundary between 
speech and non-speech class. Usually, it is chosen as the point 
between the two centers where the probabilities are equal. 

III. Experimental Setup 

A. Database Description 

Speaker recognition experiments have been conducted on 
NIST SRE 2001 \'29'\ and NIST SRE 2002 ffl. In hterature, 
it has been found that noise related experiments are mostly 
performed on those database. These two datasets have less 
variability due to channel and handset compared to the latest 
NIST corpora. Therefore, in order to observe the effect in pres- 
ence of adverse envkonmental condition, synthetic addition of 
noise to this dataset is more justified. We have performed all 
the experiments on core task condition of the evaluation plan. 
The detail description of the database is shown in Table H] In 
both the cases, we have used the development data of NIST 
SRE 2001 for UBM preparation. 



^http://speech.fit. vutbr.cz/software/phoneme- recognizer- based- long-ternporal-cont^»lll^://www.fee.vutbr.cz/SP^ 
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TABLE I 

Database description (coretest section) for the performance 
evaluation of various speech activity detectors. 





SRE 2001 


SRE 2002 


Target Models 


740", IOO9 


139&, 1915 


Test Segments 


2038 


3570 


Total Trial 


22418 


39270 


True Trial 


2038 


2983 


Impostor Trial 


20380 


36287 



B. Feature Extraction 

MFCC features have been extracted using 20 filters linearly 
spaced in mel scale from speech frame of 20ms keeping 
50% overlap with adjacent frames. The details of our feature 
extraction process can be found in |3r|. We have also shown 
the performance for our previously proposed overlapped block 
transform coefficient (OBTC) feature termed as OBT-9-13 
which extracted by performing distributed DCT on the mel 
filterbank output Ii3l1 . The dimensions of the MFCC and 
OBTC features are 38 and 40 correspondingly after static 
features are augmented with their velocity coeffificents. 

C. Classifier Description 

In this paper, all the performance are based on GMM-UBM 
based classifier where the target models are created by adapt- 
ing UBM parameters ||32]| . Initially, a gender dependent UBM 
with 256 mixtures is trained using expectation-maximization 
algorithm after initialization using split vector quantization 
with data from development section of NIST SRE 2001. Target 
models are created by adapting the centres of the UBM with 
relevance factor of 14. Finally, during score computation, only 
top-5 Gaussians of UBM per frame are considered. 

D. Performance Evaluation Metric 

The performance of speaker recognition is measured using 
equal error rate (EER) metric which is a particular operating 
point on detection error tradeoff plot where the probability 
of false rejection (FR) equals probability of false acceptance 
(FA). The performance is also measured in terms of minimum 
detection cost function (minDCF) where a cost function is 
minimized by assigning unequal cost to FR and FA. Here, the 
costs of FR and FA are set at 1 and 10 correspondingly. 

IV. Results and Discussions 

Speaker recognition experiments are conducted on both 
the databases using MFCC and OBTC feature. In all the 
experiments, everything other than the SAD techniques are 
kept fixed to observe the effect of SAD. In this paper, the 
SAD techniques are abbreviated as follows: 

. G.729B: SAD used in G.729B lU. 

• SMSAD: Statistical model based SAD proposed by Sohn 
ET. AL lITOl . 

• HNPNR: Hungarian phoneme recognizer based 
SAD E5l . 



• MEETS: Energy dependent SAD using maximum energy 
based threshold selection |4|. 

« AEBTS: Energy dependent SAD using average energy 
based threshold selection ||5l. 

• UBGME: Utterance-wise bi Gaussian modeling of log- 
energy based threshold selection |[28l . 

A. Results in Match Condition 

Experiments are first conducted on clean speech database. 
Speaker verification results with different SADs are shown in 
Table nil Frame selection technique which is used in G.729B 
is shown to perform worst for speaker recognition in almost 
all the cases. The performance of SMSAD is better than 
the performance of G.729B for most of the cases. HNPNR 
based SAD is shown to outperform the first two techniques. 
However, MEBTS based approach is consistently better than 
those techniques for all the cases. Then, bi Gaussian modeling 
based approach (UBGME) is shown to perform better than 
all the previously mentioned techniques for NIST SRE 2001. 
However, its performance is slightly reduced than HNPNR, 
MEBTS and AEBTS for NIST SRE 2002. It can also be 
observed that energy based SADs are relatively better than the 
first two techniques. Though these two techniques are very 
useful for speech coding and compression, but they are not 
much important in speaker recognition. This is most likely 
due to the fact that those techniques consider a signal frame 
as speech if it has some speech-like information. However, the 
energy based SADs consider a frame as speech frame if only if 
it has significant amount of energy. Though low energy speech 
frames may have speech information, but their contribution in 
speaker recognition seems to be negligible. 

The overall energy of the frames increases when the speech 
is distorted by noise in adverse conditions. In that case, merely 
energy of the frame seems to be unreliable for speech activity 
detection. Therefore, it is worth studying the effect of the SAD 
in speaker recognition for noisy condition. This study is carried 
out on both the databases and the same is discussed in the 
following subsection. 

B. Results in Mismatch Condition 

The experiments on noisy conditions are performed by 
synthetically adding noise to the test utterances. The noise 
samples are taken from NOISEX-92 database^ and they are 
down-sampled to 8kHz. The noise is added to the speech signal 
using the following steps: 

• A segment of noise sample from the original noise signal 
is randomly selected according to the length of the speech 
signal with whom noise is to be added. 

« The amplitude of the noise segment is scaled depending 
on desired SNR. 

• The scaled noise signal is added to the clean signal to 
get the distorted speech. 

In our experiment, we have chosen five different noise 
samples: white, pink, volvo, babble and factory (Factory- 1 
noise in the original database). We have chosen those noises 

'*http://www.speech. cs.cmu.edu/comp.speech/Sectionl/Data/noisex. html 
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TABLE II 

Speaker verification results in terms of equal error rate and minimum detection cost function on NIST SRE 2001 and NIST SRE 
2002 for different implementations of speech activity detector. [g.729b: sad used in g.729 codec f9\, smsad: statistical model 
BASED SAD iJOil, HNPNR: Hungarian Phoneme Recognizer |25|, MEBTS: Maximum Energy Based Threshold Selection |4], AEBTS: 
Average Energy Based Threshold Selection {5j, UBGME: Utterance WiseBi Gaussian Modeling of Log Energy lisl .l 



Method 


NIST SRE 2001 


NIST SRE 2002 


EER (in %) 


miiiDCF X 100 


EER (in %) 


minDCF X 100 


MFCC 


OBTC 


MFCC 


OBTC 


MFCC 


OBTC 


MFCC 


OBTC 


G.729B 


9.32 


8.59 


4.11 


3.85 


10.16 


9.83 


4.50 


4.38 


SMSAD 


8.97 


8.11 


3.92 


3.74 


9.86 


9.85 


4.51 


4.29 


HNPNR 


8.54 


7.76 


3.74 


3.48 


9.09 


8.78 


4.43 


4.35 


MEBTS 


8.24 


7.27 


3.58 


3.44 


9.09 


8.58 


4.45 


4.10 


AEBTS 


7.96 


7.21 


3.61 


3.49 


9.45 


8.59 


4.44 


4.22 


UBGME 


7.91 


7.36 


3.72 


3.41 


9.69 


9.05 


4.50 


4.28 



due to their various frequency domain behavior as shown in 
Fig. 121 The experiments are conducted for three different levels 
of noise: high (0 dB), medium (10 dB) and low (20 dB). 
The results are shown in Table |lll] and Table |IV] for the two 
databases. 

Performance of the speaker recognition system drops 
severely in presence of noise for different types of SAD. The 
degradation in performance is dependent on the frequency 
response of the noise. For example, from Fig. |2] we can 
interpret that as the frequency response of the WHITE noise 
is flat, all the frequency component of the speech signal is 
affected. Therefore, the performance is worst for this noise. 
On the other hand, volvo noise affects only first few Mel filter- 
bank output. Hence, performance in the presence of VOLVO 
noise is relatively less degraded compared to other noises of 
same SNR. 

We also note that performance is varying significantly for 
different SAD techniques in presence of noise. In case of 
G.729B and SMSAD, the performances are nearly equivalent 
for both the databases. HNPNR based voice activity detector 
which is used in state-of-the art speaker recognition system for 
speech frame selection performs poorly compared to the other 
techniques. The performance of energy thresholded SADs i.e. 
MEBTS and AEBTS suffer severely in presence of noise. The 
performance is even worse compared to G.729B and SMSAD. 
In presence of noise, all the speech frames are affected 
i.e. energy of each frames are increased and the frequency 
response of all the speech frames are severely affected. In this 
scenario, maximum energy or average energy based threshold 
selection techniques will not be much effective. Hence, vowel 
like regions are only seems to be relevant for those cases ||5l. 
The two Gaussian model based approach selects the energy 
threshold as the boundary between speech and non-speech 
class which selects vowel like higher energy frames for most 
of the part. Therefore, the performance is significantly better 
for this technique in higher noise. 

C. Results in Real-time Scenario 

In Section IIV-AI results are shown for clean condition 
whereas in Section IIV-BI speaker recognition results are 
shown for noisy condition where in every case all the test 



speech segments are distorted with same noise of equal SNR. 
However, in real life, this kind of controlled environment is 
not replicated. In practice, most of the speech utterances are 
distorted with different type of noise of various SNR. In order 
to observe the performance of speaker recognition in this kind 
of situation, we have distorted different speech files of NIST 
SRE 2002 with different noise of various SNR. The noise type 
is randomly chosen from the set of five noises used in the 
previous experiments. The SNR is randomly selected between 
to 40. We call the distorted dataset as Distorted NIST 
SRE 2002. As different speech files are collected for different 
environmental conditions, score normalization would be very 
effective here |(33l. Here, we have applied t-normalization 
on raw log likelihood scores for generating final scores. The 
utterances for t normalizations are chosen from training speech 
files of NIST SRE 2001 i.e. 74 male and 100 female speech 
files are selected for t-normalization. 

V. Conclusion & Future Work 

In this work, we briefly review some standard SAD tech- 
niques and their effects in speaker recognition performances. 
The performance of different techniques are evaluated on 
two NIST corpora: NIST SRE 2001 and NIST SRE 2002. 
SAD techniques Uke G.729B and SMSAD, which are very 
much accepted in speech coding and other applications, are 
shown to exhibit lower performance than even simple energy 
based speech activity detector The experimental results show 
that speaker recognition system with two Gaussian modeling 
of log-energy based SAD is significantly better than other 
techniques for wide range of SNR. We have evaluated the 
performances using two cepstral features: standard MFCC and 
our previously proposed OBTC. It has been shown that OBTC 
based system is superior than MFCC for most of the cases. 

The performance of the UBGME based SAD appears to be 
suboptimal in the shown cases. However, this utterance wise 
bi Gaussian modeling of log-energy based approach can be 
further improved by investigating the followings subjects: 
• When a speech utterance gets distorted by noise, its 
different frequency bands are unequally affected (Fig. 
Therefore, two Gaussian modeling of subband informa- 
tion could be used to improve the performance of the 
speech activity detector. 
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Fig. 2. Average of short term magnitude responses for clean speecli and noise samples. The speech signals are taken from NIST SRE 2001 where noise 
samples are taken from NOISEX-92 database. Averaging operation is performed over all available signal frames. 



• The UBGME based approach uses only log-energy. How- 
ever, bi Gaussian modeling of other parameters like 
entropy, spectral flatness parameter can be studied to 
extract robust speech frames. 

• Decision fusion technique |3| can be used to improve 
the performance further by combining multiple speech 
activity detector carrying supplementary information. 
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TABLE III 

Speaker verification results on NIST SRE 200 1 in presence of various additive environmental noise. The results are shown for 

VARIOUS speech ACTIVITY DETECTION TECHNIQUES. 



Method 


Noise 


EER (in %) 


minDCF X 100 


MFCC 


OBTC 


MFCC 


OBTC 


20 dB 


10 dB 


dB 


20 dB 


10 dB 


dB 


20 dB 


10 dB 


dB 


20 dB 


10 dB 


dB 


G.729B 


White 


1 1 .38 


16.69 


28.56 


10.40 


15.65 


27.28 


4.80 


7.17 


9.84 


4.52 


7.12 


9.84 


Pink 


10.70 


13.64 


23.46 


9.52 


12.90 


22.77 


4.59 


6.45 


9.37 


4.32 


6.05 


9.04 


Volvo 


11.30 


11.43 


13.14 


10.60 


10.50 


12.07 


4.96 


5.03 


6.21 


4.73 


4.71 


5.22 


Babble 


9.96 


13.74 


23.99 


9.43 


12.70 


22.93 


4.63 


6.52 


9.17 


4.13 


5.75 


8.84 


Factory 


10.55 


13.35 


20.96 


9.57 


12.41 


19.47 


4.66 


6.33 


8.96 


4.36 


5.84 


8.58 


SMVAD 


White 


1 1 .24 


15.75 


27.09 


9.91 


14.78 


27.49 


4.79 


7.00 


9.38 


/I in 
4.39 


6.98 


9.61 


Pink 


10.12 


13.35 


23.60 


9.18 


12.32 


22.62 


4.52 


6.34 


9.17 


4.18 


5.92 


9.02 


Volvo 


10.85 


11.13 


13.20 


10.16 


10.26 


11.97 


4.79 


5.05 


5.89 


4.62 


4.55 


5.11 


Babble 


10.16 


13.82 


24.92 


8.92 


1 1.83 


22.87 


4.62 


6.44 


9.22 


4.08 


5.55 


8.86 


Factory 


10.06 


13.05 


21.19 


9.27 


11.63 


19.28 


4.55 


6.03 


8.70 


4.28 


5.67 


8.42 


HNPNR 


White 


11.19 


16.44 


34.64 


9.81 


15.02 


34.49 


4.99 


7.41 


10.00 


4. JO 


7.35 


10.00 


Pink 


9.87 


13.15 


29.94 


8.83 


12.41 


28.70 


4.63 


6.46 


10.00 


4.30 


6.21 


10.00 


Volvo 


9.67 


9.86 


12.32 


8.93 


9.33 


10.84 


4.62 


4.65 


5.80 


4.16 


4.21 


4.98 


Babble 


9.43 


12.95 


24.14 


8.54 


11.68 


23.70 


4.29 


6.06 


9.69 


3.92 


5.64 


9.92 


Factory 


9.62 


12.37 
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8.73 
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10.00 


4.52 
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8.06 
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8.29 
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4.08 
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3.85 
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speaker recognition," Audio, Speech, and Language Processing, IEEE 
Transactions on, vol. 19, no. 7, pp. 2026-2038, sept. 2011. 

[22] R. Venkatesha Prasad, A. Sangwan, H. Jamadagni, M. Chiranth, R. Sah, 
and V. Gaurav, "Comparison of voice activity detection algorithms for 
voip," in Computers and Communications, 2002. Proceedings. ISCC 
2002. Seventh International Symposium on, 2002, pp. 530-535. 

[23] F. Beritelli, S. Casale, G. Ruggeri, and S. Seirano, "Performance eval- 
uation and comparison of G.729/AMR/fuzzy voice activity detectors," 
Signal Processing Letters, IEEE, vol. 9, no. 3, pp. 85-88, march 2002. 

[24] K. El-Maleh and P. Kabal, "Comparison of voice activity detection 
algorithms for wireless personal communications systems," in Electrical 
and Computer Engineering, 1997. Engineering Innovation: Voyage of 
Discovery. IEEE 1997 Canadian Conference on, vol. 2, may 1997, pp. 
470-473. 

[25] P. Schwarz, "Phoneme recognition based on long temporal context," 
Ph.D. dissertation, Brno University of Technology, 2009. 

[26] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny, "Com- 
parison of scoring methods used in speaker recognition with joint factor 
analysis," in IEEE International Conference on Acoustics, Speech and 
Signal Processing (ICASSP 2009), april 2009, pp. 4057^060. 

[27] M. J. Alam, T. Kinnunen, P. Kenny, P. Ouellet, and D. O'Shaughnessy, 
"Multitaper MFCC and PLP features for speaker verification using i- 
vectors," Speech Communication, no. 0, pp. -, 2012, article in Press. 

[28] I. Magrin-Chagnolleau, G. Gravier, and R. Blouet, "Overview of the 



2000-2001 ELISA consortium research activities," in 2001: A Speaker 
Odyssey The Speaker Recognition Workshop, 2001. 

[29] M. Przybocki and A. Martin, "2001 nist speaker recognition evaluation 
corpus," Linguistic Data Consortium, 2002. 

[30] The NIST Year 2002 Speaker Recogni- 

tion Evaluation Plan. [Online]. Available: 

http://www.itl.nist.gov/iad/mig/tests/spk/2002/2002-spkrec-evalplan-v60.pdf] 

[31] M. SahiduUah and G. Saha, "Design, analysis and experimental eval- 
uation of block based transformation in mfcc computation for speaker 
recognition," Speech Communication, vol. 54, no. 4, pp. 543-565. May 
2012. 

[32] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification 
using adapted gaussian mixture models," Digital Signal Processing, 
vol. 10, no. 1-3, pp. 19 - 41, 2000. 

[33] R. Auckenthaler, M. Carey, and H. Lloyd- Thomas, "Score normaliza- 
tion for text-independent speaker verification systems," Digital Signal 
Processing, vol. 10, no. 1-3, pp. 42 - 54, 2000. 



TABLE IV 

Speaker verification results on NIST SRE 2002 in presence of various additive environmental noise. The results are shown for 

VARIOUS speech ACTIVITY DETECTION TECHNIQUES. 



Method 


Noise 


EER (in %) 


minDCF x 100 
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1 1 .43 
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5.45 


7.06 


9.40 


5.20 


6.65 
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White 


12.26 
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27.86 


1 1.97 
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28.43 


5.59 


1.57 


9.81 


5.33 


7.51 


9.76 
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5.65 
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15.66 
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11.47 


14.95 
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5.34 
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6.55 
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17.90 


35.20 
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17.40 


35.40 


5.85 


8.08 


10.00 


5.62 


7.96 
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1 1.53 


15.96 


31.48 
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15.22 


30.44 


5.69 


7.59 
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5.49 
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10.29 


19.01 


33.02 


5.15 


8.38 


9.88 


4.87 
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1 1.19 
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9.82 


10.06 


1 1.20 


4.82 
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4.69 


5.24 
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10.60 


17.20 


32.41 
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15.46 


31.48 


5.07 


7.59 


9.82 


4.61 


7.12 


9.82 
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17.67 
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17.36 
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5.07 
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10.00 


5.57 
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10.52 
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34.16 


10.09 


18.64 


32.55 


4.98 


8.42 


9.97 


4.76 


8.22 


9.91 


Volvo 


10.16 
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9.76 


10.22 


11.22 


4.73 


5.09 


5.78 


4.60 


4.76 


5.16 
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10.56 


15.52 


32.35 


9.86 


14.21 


31.41 


4.88 


7.07 


9.91 


4.59 


6.59 


9.82 
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16.93 


32.58 


9.76 


16.45 


30.51 


4.90 


7.70 


9.88 


4.64 


7.29 


9.77 
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11.70 


16.06 


26.92 


11.10 


15.99 


26.95 


5.29 


7.02 


9.71 


5.05 


7.04 


9.61 
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14.45 
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10.50 


13.95 


23.24 


5.08 


6.52 


9.29 


4.85 


6.29 


9.10 
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10.69 


11.46 


12.77 


10.16 
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11.26 


4.88 


5.15 


5.83 
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4.78 


5.31 


Babble 


10.76 


14.29 


23.10 


10.16 


13.21 


23.13 


4.99 


6.43 


9.16 


4.60 


6.13 


9.05 


Factory 


10.76 


14.25 


24.57 


10.19 


13.48 


24.04 


5.00 


6.54 


9.38 


4.70 


6.24 


9.14 



TABLE V 

Speaker verification results in distorted NIST SRE 2002 for different speech activity detectors. 



Method 


w/o t-norm 


with t-norm 


EER (in %) 


minDCF x 100 


EER (in %) 


minDCF x 100 


MFCC 


OBTC 


MFCC 


OBTC 


MFCC 


OBTC 


MFCC 


OBTC 
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21.69 


19.41 


8.52 


8.05 


21.22 


18.54 


8.14 


7.53 
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22.09 


19.51 


8.65 


8.16 


21.35 
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8.37 


7.72 
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8.75 
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20.72 


18.64 


8.14 


7.49 
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22.79 


9.94 


9.93 


23.57 


22.26 


9.96 


9.97 
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23.33 


21.89 


9.96 


9.93 


22.90 


21.45 


9.97 


9.96 
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19.14 


17.53 


8.32 


8.03 


18.34 


16.43 


7.88 


7.38 



